matthews_corrcoef (Matthews Correlation Coefficient, MCC)#
The Matthews correlation coefficient (MCC) is a single-number summary of a classifier’s confusion matrix. It can be interpreted as the Pearson correlation between true and predicted labels, so it naturally lives in \([-1, 1]\):
\(+1\): perfect predictions
\(0\): no better than random (no correlation)
\(-1\): perfectly wrong (systematic inversion)
MCC is especially useful when classes are imbalanced, because it uses all four confusion-matrix entries (TP, TN, FP, FN).
Learning goals#
derive MCC from the confusion matrix and from Pearson correlation
implement MCC from scratch in NumPy (binary + multiclass)
build intuition with Plotly visuals (imbalance + thresholding)
use MCC to select a decision threshold / tune a simple model
Table of contents#
Confusion matrix recap
Binary MCC: definition + correlation view
Multiclass MCC
NumPy implementation (from scratch)
Intuition plots (TPR/TNR surface + imbalance trap)
Using MCC for optimization: threshold tuning for logistic regression
Pros, cons, and when to use MCC
Exercises + references
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_classification
from sklearn.metrics import matthews_corrcoef as sk_matthews_corrcoef
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
1) Confusion matrix recap#
For binary classification, assume the positive class is labeled \(1\) and the negative class is labeled \(0\).
A confusion matrix counts outcomes:
predicted \(1\) |
predicted \(0\) |
|
|---|---|---|
true \(1\) |
TP |
FN |
true \(0\) |
FP |
TN |
With total sample size:
Useful rates:
TPR / recall / sensitivity: \(\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}\)
TNR / specificity: \(\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}}\)
MCC “wants” both TPR and TNR to be high.
2) Binary MCC: definition + correlation view#
2.1 Definition (confusion-matrix form)#
The (binary) Matthews correlation coefficient is
The numerator rewards agreement (TP·TN) and penalizes disagreement (FP·FN).
The denominator normalizes to keep the score in \([-1, 1]\).
If the denominator is \(0\) (e.g. constant predictions, or all labels are the same), MCC is mathematically undefined. In practice (and in scikit-learn), it is returned as 0.0.
2.2 MCC = Pearson correlation for 0/1 labels#
Let \(Y, \hat Y \in \{0,1\}\) be the true and predicted labels. The Pearson correlation is
Using the contingency table above:
\(\mathbb{E}[Y] = \frac{\text{TP}+\text{FN}}{N}\)
\(\mathbb{E}[\hat Y] = \frac{\text{TP}+\text{FP}}{N}\)
\(\mathbb{E}[Y\hat Y] = \frac{\text{TP}}{N}\)
So
And
Plugging these into \(\rho\) yields the MCC formula. This is why MCC is also known as the phi coefficient (correlation for two binary variables).
3) Multiclass MCC#
MCC has a natural multiclass extension based on the full \(K\times K\) confusion matrix.
Let \(C\in\mathbb{N}^{K\times K}\) with entries:
Define:
\(s = \sum_{i,j} C_{ij}\) (total)
\(c = \sum_k C_{kk}\) (correct / trace)
\(t_k = \sum_j C_{k j}\) (true count per class; row sums)
\(p_k = \sum_i C_{i k}\) (predicted count per class; column sums)
Then the multiclass MCC is:
It reduces to the binary formula when \(K=2\), and can be viewed as a correlation between one-hot encodings of \(y\) and \(\hat y\).
4) NumPy implementation (from scratch)#
We’ll implement:
a simple confusion matrix builder
MCC for binary and multiclass using the \(K\times K\) formula
(optionally) the binary closed-form as a sanity check
def confusion_matrix_np(y_true, y_pred, labels=None):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have the same shape")
if labels is None:
labels = np.unique(np.concatenate([y_true, y_pred]))
else:
labels = np.asarray(labels)
label_to_index = {label: i for i, label in enumerate(labels.tolist())}
true_idx = np.fromiter((label_to_index.get(v, -1) for v in y_true), dtype=int, count=y_true.size)
pred_idx = np.fromiter((label_to_index.get(v, -1) for v in y_pred), dtype=int, count=y_pred.size)
if (true_idx < 0).any() or (pred_idx < 0).any():
raise ValueError("labels must contain all values appearing in y_true and y_pred")
k = labels.size
cm = np.zeros((k, k), dtype=int)
np.add.at(cm, (true_idx, pred_idx), 1)
return cm, labels
def mcc_from_counts(tp, tn, fp, fn):
tp = np.asarray(tp, dtype=float)
tn = np.asarray(tn, dtype=float)
fp = np.asarray(fp, dtype=float)
fn = np.asarray(fn, dtype=float)
num = tp * tn - fp * fn
denom = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
return np.where(denom == 0, 0.0, num / denom)
def matthews_corrcoef_np(y_true, y_pred, labels=None) -> float:
cm, _ = confusion_matrix_np(y_true, y_pred, labels=labels)
t_sum = cm.sum(axis=1, dtype=float) # true per class
p_sum = cm.sum(axis=0, dtype=float) # predicted per class
s = float(cm.sum())
c = float(np.trace(cm))
num = c * s - float(np.dot(t_sum, p_sum))
denom = np.sqrt((s**2 - float(np.dot(p_sum, p_sum))) * (s**2 - float(np.dot(t_sum, t_sum))))
return 0.0 if denom == 0.0 else num / denom
def confusion_counts_binary(y_true, y_pred, positive_label=1):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
true_pos = y_true == positive_label
pred_pos = y_pred == positive_label
tp = int(np.sum(true_pos & pred_pos))
tn = int(np.sum(~true_pos & ~pred_pos))
fp = int(np.sum(~true_pos & pred_pos))
fn = int(np.sum(true_pos & ~pred_pos))
return tp, tn, fp, fn
# Quick sanity check vs scikit-learn
y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 0])
cm, labels = confusion_matrix_np(y_true, y_pred)
tp, tn, fp, fn = confusion_counts_binary(y_true, y_pred, positive_label=1)
print("labels:", labels)
print("confusion matrix (rows=true, cols=pred):\n", cm)
print("TP,TN,FP,FN:", tp, tn, fp, fn)
print("MCC (scratch, KxK):", matthews_corrcoef_np(y_true, y_pred))
print("MCC (scratch, binary counts):", float(mcc_from_counts(tp, tn, fp, fn)))
print("MCC (sklearn):", sk_matthews_corrcoef(y_true, y_pred))
labels: [0 1]
confusion matrix (rows=true, cols=pred):
[[4 1]
[1 2]]
TP,TN,FP,FN: 2 4 1 1
MCC (scratch, KxK): 0.4666666666666667
MCC (scratch, binary counts): 0.4666666666666667
MCC (sklearn): 0.4666666666666667
4.1 Multiclass sanity check#
MCC supports multiclass via the confusion-matrix generalization. We’ll verify our NumPy implementation against scikit-learn on a simple 3-class example.
# Multiclass sanity check (K=3)
y_true_mc = rng.integers(0, 3, size=500)
y_pred_mc = y_true_mc.copy()
noise = rng.random(size=y_true_mc.size) < 0.25
# if noisy: replace with a random label in {0,1,2}
y_pred_mc[noise] = rng.integers(0, 3, size=int(noise.sum()))
mcc_mc = matthews_corrcoef_np(y_true_mc, y_pred_mc)
cm_mc, labels_mc = confusion_matrix_np(y_true_mc, y_pred_mc)
fig = px.imshow(
cm_mc,
x=[f"pred {l}" for l in labels_mc],
y=[f"true {l}" for l in labels_mc],
text_auto=True,
color_continuous_scale="Blues",
)
fig.update_layout(title=f"Multiclass confusion matrix (MCC={mcc_mc:.3f})")
fig.show()
print("MCC multiclass (scratch):", mcc_mc)
print("MCC multiclass (sklearn):", sk_matthews_corrcoef(y_true_mc, y_pred_mc))
MCC multiclass (scratch): 0.759901798558076
MCC multiclass (sklearn): 0.759901798558076
5) Intuition plots#
5.1 MCC as a function of TPR and TNR#
If we fix the class prevalence \(\pi = P(Y=1)\) and imagine a classifier with some \((\text{TPR},\text{TNR})\), the expected confusion counts (for large \(N\)) are:
Plotting MCC over \((\text{TPR},\text{TNR})\) shows how both kinds of mistakes affect the score.
def plot_mcc_surface(pi: float, grid_steps: int = 101, title: str | None = None):
t = np.linspace(0.0, 1.0, grid_steps)
tpr, tnr = np.meshgrid(t, t, indexing="xy")
n = 1.0 # scale cancels out in MCC
tp = n * pi * tpr
fn = n * pi * (1 - tpr)
tn = n * (1 - pi) * tnr
fp = n * (1 - pi) * (1 - tnr)
z = mcc_from_counts(tp, tn, fp, fn)
fig = px.imshow(
z,
x=t,
y=t,
origin="lower",
aspect="auto",
zmin=-1,
zmax=1,
color_continuous_scale="RdBu",
labels={"x": "TPR (recall)", "y": "TNR (specificity)", "color": "MCC"},
)
fig.add_trace(
go.Scatter(x=t, y=t, mode="lines", name="TPR = TNR", line=dict(color="black", dash="dash"))
)
fig.update_layout(
title=title or f"MCC surface over (TPR, TNR) with prevalence π={pi:.2f}",
coloraxis_colorbar=dict(title="MCC"),
)
return fig
fig = plot_mcc_surface(pi=0.10)
fig.show()
/tmp/ipykernel_653561/325970358.py:35: RuntimeWarning:
invalid value encountered in divide
5.2 The “accuracy trap” under imbalance#
Consider a dataset where the positive class is rare. A trivial classifier that predicts always negative can achieve very high accuracy, even though it’s useless.
MCC exposes this: constant predictions lead to an MCC of 0.
prevalence = np.linspace(0.001, 0.999, 300)
acc_always_negative = 1.0 - prevalence
mcc_always_negative = np.zeros_like(prevalence)
balanced_acc_always_negative = np.full_like(prevalence, 0.5)
fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalence, y=acc_always_negative, name="accuracy (always predict 0)"))
fig.add_trace(go.Scatter(x=prevalence, y=balanced_acc_always_negative, name="balanced accuracy (always 0)", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=prevalence, y=mcc_always_negative, name="MCC (always 0)", line=dict(dash="dot")))
fig.update_layout(
title="Imbalance demo: accuracy can look great while MCC stays 0",
xaxis_title="Positive prevalence π = P(Y=1)",
yaxis_title="Metric value",
yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()
6) Using MCC for optimization: threshold tuning for logistic regression#
MCC is not differentiable with respect to model parameters (it depends on discrete labels), so we typically:
train a probabilistic model with a smooth loss (e.g. log-loss)
choose a decision threshold (or hyperparameters) that maximizes MCC on a validation set
Below is a minimal from-scratch logistic regression and an MCC-based threshold selection.
def add_intercept(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def sigmoid(z):
z = np.asarray(z, dtype=float)
out = np.empty_like(z, dtype=float)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
ez = np.exp(z[~pos])
out[~pos] = ez / (1.0 + ez)
return out
def binary_log_loss(y_true, p, eps: float = 1e-15) -> float:
y_true = np.asarray(y_true, dtype=float)
p = np.asarray(p, dtype=float)
p = np.clip(p, eps, 1.0 - eps)
return float(-np.mean(y_true * np.log(p) + (1.0 - y_true) * np.log(1.0 - p)))
def standardize_fit(X_train: np.ndarray):
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0)
sigma = np.where(sigma == 0, 1.0, sigma)
return mu, sigma
def standardize_apply(X: np.ndarray, mu: np.ndarray, sigma: np.ndarray) -> np.ndarray:
return (X - mu) / sigma
def fit_logistic_regression_gd(
X: np.ndarray,
y: np.ndarray,
lr: float = 0.1,
n_iter: int = 2000,
l2: float = 0.0,
):
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
n, d = X.shape
w = np.zeros(d)
losses = np.empty(n_iter)
for i in range(n_iter):
p = sigmoid(X @ w)
# average log-loss + L2 (skip intercept)
losses[i] = binary_log_loss(y, p) + 0.5 * l2 * float(np.dot(w[1:], w[1:]))
grad = (X.T @ (p - y)) / n
grad[1:] += l2 * w[1:]
w -= lr * grad
return w, losses
def safe_div(num, denom):
num = np.asarray(num, dtype=float)
denom = np.asarray(denom, dtype=float)
return np.where(denom == 0, 0.0, num / denom)
def binary_metrics_from_counts(tp, tn, fp, fn):
tp = np.asarray(tp, dtype=float)
tn = np.asarray(tn, dtype=float)
fp = np.asarray(fp, dtype=float)
fn = np.asarray(fn, dtype=float)
acc = safe_div(tp + tn, tp + tn + fp + fn)
precision = safe_div(tp, tp + fp)
recall = safe_div(tp, tp + fn)
f1 = safe_div(2 * precision * recall, precision + recall)
tpr = recall
tnr = safe_div(tn, tn + fp)
bal_acc = 0.5 * (tpr + tnr)
mcc = mcc_from_counts(tp, tn, fp, fn)
return {
"mcc": mcc,
"accuracy": acc,
"balanced_accuracy": bal_acc,
"precision": precision,
"recall": recall,
"f1": f1,
}
# Synthetic, imbalanced dataset
X, y = make_classification(
n_samples=4000,
n_features=6,
n_informative=4,
n_redundant=0,
n_clusters_per_class=2,
weights=[0.90, 0.10],
class_sep=1.2,
flip_y=0.02,
random_state=7,
)
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=7)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7)
mu, sigma = standardize_fit(X_train)
X_train_s = standardize_apply(X_train, mu, sigma)
X_val_s = standardize_apply(X_val, mu, sigma)
X_test_s = standardize_apply(X_test, mu, sigma)
X_train_i = add_intercept(X_train_s)
X_val_i = add_intercept(X_val_s)
X_test_i = add_intercept(X_test_s)
w, losses = fit_logistic_regression_gd(X_train_i, y_train, lr=0.15, n_iter=2500, l2=1e-2)
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(losses.size), y=losses, name="train loss"))
fig.update_layout(title="From-scratch logistic regression (GD): training loss", xaxis_title="iteration", yaxis_title="loss")
fig.show()
# Threshold sweep on validation set
p_val = sigmoid(X_val_i @ w)
thresholds = np.linspace(0.0, 1.0, 401)
y_val_bool = y_val.astype(bool)
pred_pos = p_val[:, None] >= thresholds[None, :]
tp = np.sum(pred_pos & y_val_bool[:, None], axis=0)
fp = np.sum(pred_pos & ~y_val_bool[:, None], axis=0)
fn = np.sum(~pred_pos & y_val_bool[:, None], axis=0)
tn = np.sum(~pred_pos & ~y_val_bool[:, None], axis=0)
metrics = binary_metrics_from_counts(tp, tn, fp, fn)
best_idx = int(np.argmax(metrics["mcc"]))
best_t = float(thresholds[best_idx])
best_acc_idx = int(np.argmax(metrics["accuracy"]))
best_acc_t = float(thresholds[best_acc_idx])
best_t, best_acc_t
/tmp/ipykernel_653561/2526279986.py:70: RuntimeWarning:
invalid value encountered in divide
/tmp/ipykernel_653561/325970358.py:35: RuntimeWarning:
invalid value encountered in divide
(0.305, 0.45)
# Plot how metrics change with the decision threshold
fig = go.Figure()
for name, values in metrics.items():
fig.add_trace(go.Scatter(x=thresholds, y=values, name=name))
fig.add_vline(x=0.5, line_dash="dot", line_color="gray", annotation_text="t=0.5")
fig.add_vline(
x=best_acc_t,
line_dash="dash",
line_color="gray",
annotation_text=f"best accuracy t={best_acc_t:.3f}",
)
fig.add_vline(
x=best_t,
line_dash="dash",
line_color="black",
annotation_text=f"best MCC t={best_t:.3f}",
)
fig.update_layout(
title="Validation curves vs threshold",
xaxis_title="threshold",
yaxis_title="metric value",
yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()
# Evaluate on the test set and compare thresholds
p_test = sigmoid(X_test_i @ w)
def metrics_at_threshold(t: float):
y_pred = (p_test >= t).astype(int)
tp, tn, fp, fn = confusion_counts_binary(y_test, y_pred, positive_label=1)
m = binary_metrics_from_counts(tp, tn, fp, fn)
return {k: float(v) for k, v in m.items()}, y_pred
for t in [0.5, best_acc_t, best_t]:
m, _ = metrics_at_threshold(t)
print(
f"t={t:.3f} | MCC={m['mcc']:.3f} | acc={m['accuracy']:.3f} | bal_acc={m['balanced_accuracy']:.3f} | F1={m['f1']:.3f}"
)
# Confusion matrix for the MCC-optimal threshold
m_best, y_test_pred = metrics_at_threshold(best_t)
cm_test, _ = confusion_matrix_np(y_test, y_test_pred, labels=np.array([0, 1]))
fig = px.imshow(
cm_test,
x=["pred 0", "pred 1"],
y=["true 0", "true 1"],
text_auto=True,
color_continuous_scale="Blues",
)
fig.update_layout(title=f"Test confusion matrix (threshold={best_t:.3f}, MCC={m_best['mcc']:.3f})")
fig.show()
print("Test MCC (scratch):", m_best["mcc"])
print("Test MCC (sklearn):", sk_matthews_corrcoef(y_test, y_test_pred))
t=0.500 | MCC=0.578 | acc=0.931 | bal_acc=0.706 | F1=0.567
t=0.450 | MCC=0.601 | acc=0.934 | bal_acc=0.733 | F1=0.607
t=0.305 | MCC=0.567 | acc=0.921 | bal_acc=0.767 | F1=0.609
Test MCC (scratch): 0.5667784612933667
Test MCC (sklearn): 0.5667784612933667
7) Pros, cons, and when to use MCC#
Pros#
Uses all of TP/TN/FP/FN (unlike precision/recall which ignore TN/TP)
Robust under class imbalance (unlike accuracy)
Symmetric: swapping positive/negative labels does not change the value
Single interpretable scale (\([-1,1]\)) with a correlation meaning
Works for multiclass via the confusion-matrix generalization
Cons / caveats#
Can be undefined when predictions (or labels) are constant; commonly returned as 0 by convention
Non-differentiable w.r.t. model parameters → not a direct gradient-descent loss
Threshold-dependent for probabilistic models; you often need threshold tuning
Can be noisy/unstable with very small sample sizes or extremely rare classes
When MCC shines#
Imbalanced binary classification where both error types matter (FP and FN)
Model selection and threshold tuning when you want a single score that “respects” the full confusion matrix
Domains with strong imbalance and asymmetric costs where accuracy is misleading (bioinformatics, fraud, anomaly-ish settings)
8) Exercises + references#
Exercises#
Compute MCC by hand for a few confusion matrices and interpret the sign.
Implement a multiclass demo: generate \(K=3\) labels, perturb predictions, and verify your MCC matches scikit-learn.
On the logistic regression demo above:
compare the threshold that maximizes accuracy vs MCC
try a more imbalanced dataset (e.g. 99/1) and re-run the threshold sweep
Implement cross-validated model selection where the chosen hyperparameter maximizes validation MCC.
References#
Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
scikit-learn docs:
sklearn.metrics.matthews_corrcoefThe phi coefficient (binary correlation) and its relationship to MCC